FP-GROWTH APPROACH FOR DOCUMENT CLUSTERING by
نویسندگان
چکیده
Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to group or categorize text documents. Most of the existing document clustering techniques use a group of keywords from each document to cluster the documents. In this thesis, we have used a sense based approach to cluster documents instead of using only the frequency of the keywords. We use relationships between the keywords to cluster the documents. The relationships are retrieved from the WordNet ontology and represented in the form of a graph. The document-graphs, which reflect the essence of the documents, are searched in order to find the frequent subgraphs. To discover the frequent subgraphs, we use the Frequent Pattern Growth (FP-growth) approach, which was originally designed to discover frequent patterns. The common frequent subgraphs discovered by the FP-growth approach are later used to cluster the documents. The FP-growth approach requires the creation of an FP-tree. Mining the FP-tree, which is created for a normal transaction database, is easier compared to large documentgraphs, mostly because the itemsets in a transaction database is smaller compared to the edge list of our document-graphs. Original FP-tree mining procedure is also easier because the items of a traditional transaction database are stand-alone entities and have no direct connection to each other. In contrast, as we look for subgraphs in graphs, they become related to each other in the context of connectivity. The computation cost makes the original FP-growth approach somewhat inefficient for text documents. We modify the FP-growth approach, making it possible to generate frequent subgraphs from the FP-tree. Later, we cluster documents using these subgraphs.
منابع مشابه
Fp-growth Approach for Document Clustering
Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to group or categorize text documents. Most of the existing document clustering techniques use a group of keywords from each document to cluster the documents. In this thesis, we have used a sense based approach to cluster documents instead of using only the frequency of the...
متن کاملAn FP-Growth Approach to Mining Association Rules
In the field of data mining researchers implements lots of algorithms for improving the performance of mining. This work is also related to that strategy. This work, introduce an idea in this field. Here use Sampling Technique to convert text document in to the appropriate format. This format contains data in the form of word and topic of word. This format take as a input in FPGrowth algorithm ...
متن کامل‘Only Systems Thinking Can Improve Family Planning Program in Pakistan’: A Descriptive Qualitative Study
Background Family Planning (FP) program in Pakistan has been struggling to achieve the desired indicators. Despite a well-timed initiation of the program in late 50s, fertility decline has been sparingly slow. As a result, rapid population growth is impeding economic development in the country. A high population growth rate, the current fertility rate, a stagnant contraceptive prevalence rate a...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملGallbladder Segmentation in 2-D Ultrasound Images Using Deformable Contour Methods
o Gallbladder Segmentation in 2-D Ultrasound Images using Deformable Contour Methods M. Ciecholewski o Pattern Mining on Stars with FP-Growth A. Silva, C. Antunes o Non-hierarchical Clustering of Decision Tables toward Rough Set-based Group Decision Aid M. Inuiguchi, R. Enomoto, Y. Kusunoki o An Enhanced Framework Of Subjective Logic For Semantic Document Analysis S. Manna, B. Sumudu. U. Mendis...
متن کامل